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Improving Search Algorithms by Using Intelligent Coordinates 
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We consider the problem of designing a set of computational agents so that as they all pursue 
their self-interests a global function G of the collective system is optimized. Three factors govern the 
quality of such design. The first relates to conventional exploration-exploitation search algorithms 
for finding the maxima of such a global function, e.g., simulated annealing (SA). Game-theoretic 
algorithms instead are related to the second of those factors, and the third is related to techniques 
from the field of machine learning. Here we demonstrate how to exploit all three factors by modifying 
the search algorithm's exploration stage so that rather than by random sampling, each coordinate of 
the underlying search space is controlled by an associated machine-learning-based "player" engaged 
in a non-cooperative game. Experiments demonstrate that this modification improves SA by up to 
an order of magnitude for bin-packing and for a model of an economic process run over an underlying 
network. These experiments also reveal novel small worlds phenomena. 

PACS numbers: 89.20.Ff,89.75.-k ,89.75. Fb,02.60.Pn, 02.70.-c, 02.70.Tt 



I. INTRODUCTION 

Many distributed systems found in nature have in- 
spired function-maximization algorithms. In some of 
these the coordinates of the underlying system are viewed 
as players engaged in a non-cooperative game, whose 
joint behavior (hopefully) maximizes the pre-specified 
global function of the entire system. Examples of such 
systems are auctions and clearing of markets. Typically 
in the computer-based algorithms inspired by such "col- 
lectives" of players, each separate coordinate of the sys- 
tem is controlle d by an associated machine learning al- 
gorithm 0, |j, 0> UH u3 > reinforcement-learning (RL) 
algorithms being particularly common 0, ^ . 

One important issue concerning such collectives is 
whether the payoff function g n of each player rj is suf- 
ficiently sensitive to what coordinates rj controls in com- 
parison to the other coordinates, so that 77 can learn how 
to control its coordinates to achieve high payoff. A second 
crucial issue is the need for all of the g v to be "aligned" 
with G, so that as the players individually learn how to 
increase their payoffs, G also increases. 

Previous work in the Collective INtelligence (COIN) 
framework addresses these issues. This work extends 
conventional game-theoretic mechanism design 0, 0] to 
include off-equilibrium behavior, learnability issues, g v 
with non- human attributes (e.g., g n for which incentive 
compatibility is irrelevant), and arbitrary G. In domains 
from network routing to congestion problems it outper- 
form traditional techniques, by up to several orders of 
magnitude for large systems [T|| u3 • 

Other collective systems found in nature that have 
inspired search algorithms do not involve players con- 
ducting a non-cooperative game. Examples include spin 
glasses, genomes undergoing neo-Darwinian natural se- 
lection, and eusocial insect colonies, which have been 
translated into simulated annealing (SA Hl|n. ge- 
netic algorithms []J, |5(, and swarm intelligence |2|, Il2fl . 
respectively. An important issue here is the explo- 
ration/exploitation dynamics of the overall collective. 



Recent analysis reveals how G is governed by the inter- 
action between exploration/exploitation, the alignment 
of the g n and G, and the learnability of the g v |21| . Here 
we use that analysis to motivate a hybrid algorithm, In- 
telligent Coordinates for search (IC), that addresses all 
three issues. It works by modifying any exploration- 
based search algorithm so that each coordinate being 
searched is made "intelligent" , its exploration value being 
the move of a game-playing computer algorithm rather 
than the random sample of a probability distribution. 

Like SA, IC is intended to be used as an "off the shelf" 
algorithm; rarely will it be the best possible algorithm 
for some particular domain. Rather it is designed for use 
in very large problems where parallelization can provide 
a large advantage, while there is little exploitable infor- 
mation concerning gradients. We present experiments 
comparing IC and SA on two archetypal domains: bin- 
packing and an economic model of people choosing for- 
mats for their home music systems. 

In the bin-packing domain IC achieves a given value of 
G up to three orders of magnitude faster than does SA, 
with the improvement increasing linearly with the size of 
the problem. In the format choice problem G is the sum 
of each person's "happiness" with her format choices. 
Each person 77's happiness with each of her choices is 
set by three factors: which of her nearest neighbors on 
a ring network (77's "friends" ) make that choice; 77's in- 
trinsic preference for that choice; and the price of music 
purchased in that format, inversely proportional to the 
total number of players using that choice. Here again, IC 
improves G two orders of magnitude more quickly than 
does SA. We also considered an algorithm similar to the 
Groves mechanism of economics; IC outperformed it by 
over two orders of magnitude. We also modified the ring 
to be a small- worlds network [la. 111! fl9| . This barely 
improved IC's performance (3%), with no effect on the 
other algorithms. However if G was also changed, so that 
each 77's happiness depends on agreeing with her friends' 
friends, the performance increase in changing to a small- 
worlds topology is significant (10%). This underscores 
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the multiplicity of factors behind the benefits of small- 
worlds networks. 



II. SIMPLIFIED THEORY OF COLLECTIVES 

Let z € C be the joint move of all agents/players in the 
collective. We want the z that maximizes the provided 
world utility G{z). In addition to G we have private 
utility functions {g^}, one for each agent r\ controlling 
Zrj. "77 refers to all agents other than rj. 

Intelligence "standardizes" utility functions so that 
the value they assign to z only reflects their ranking of z 
relative to some other z'. One form of it is 

N ViU (z) = J d(i z . n (z')Q[U(z) - U(z')) , (1) 

where is the Heaviside function, and where the sub- 
script on the (normalized) measure dji indicates it is re- 
stricted to z 1 such that z' = z~ n . 

Our uncertainty concerning the system induces a dis- 
tribution P(z). All attributes of the collective we can 
set, e.g., the private utility functions of the agents, are 
given by the value of the design coordinate s. Bayes 
theorem provides the central equation: 

P(G I s) = (2) 
J dN G P(G I N G ,s) J dN g P(N G I N g ,s)P(N g \ s) , 

where Ng and N g are the intelligence vectors for all the 
agents, for utilities g v and for G, respectively. N v ^ r (z) 
1 means that agent 77's move maximizes its utility, given 
the moves of the other agents. So N g {z) = 1 means z 
is a Nash equilibrium. Conversely, N G {z') = 1 means 
that the value of G cannot increase in moving from z' 
along any single (sic) coordinate of £. So if these two 
points are identical, then if the agents do well enough at 
maximizing their private utilities they must be near an 
(on-axis) maximizing point for G. 

More formally, say for our s the third conditional prob- 
ability in the integrand in the central equation ("term 
3") is peaked near N g = 1. Then s probably induces 
large (private utility function) intelligences (intuitively, 
the utilities are learnable) . If in addition the second term 
is peaked near N G = N g , then N G will also be large (in- 
tuitively, the private utility is "aligned with G"). This 
peakedness is assured if N g — N G exactly Vz. Such a sys- 
tem is said to be factored. Finally, if the first term in 
the integrand is peaked about high G when N G is large, 
then s probably induces high G, as desired. 

As a trivial example, a team game, where g v = G V77, 
is factored 0. However team games usually have poor 
third terms, especially in large collectives. This is be- 
cause each rj has to discern how its moves affect g n = G, 
given the background of the (varying) moves of the other 
agents whose moves comparably affect G. 



Fix some f(z v ), two moves z^ and z v 2 , a utility U, a 
value s, and a z~ v . The associated learnability is 

Af(U;Z' r)) s,z v L i z v ' 2 )= (3) 

/ T^(^; Z' V , z^ 1 ) — E(U; z~ v , z v 2 )} 2 
V / dz v [f(z v )Var(U;z- ri ,z v )] 

The averages and variance here are evaluated according 
to P(U\n v )P(n n \z~ n , z^ 1 ), P(U\n v )P(n v \z~ v , z v ), and 
P(U\n v )P(n v \z~ v , z v 2 ), respectively, where n v is rj's 
training set, formed by sampling U. 

The denominator in Eq. [3] reflects the sensitivity of 
U(z) to z~. q , while the numerator reflects its sensitiv- 
ity is to z^. So the greater the learnability of g n , the 
more g-qiz) depends only on the move of agent rj, i.e., 
the more learnable g v is. More formally, it can be shown 
that if appropriately scaled, g' v will result in better ex- 
pected intelligence for agent rj than will g^ whenever 
A/W^-^s,?, 1 ^, 2 ) > kf(g v ;z~ v ,s,z n l ,z v 2 ) for all 
pairs of moves z^ 1 , z^ 2 pol|. 

A difference utility is one of the form U (z) = G(z) — 
D(z~ v ). Any difference utility is factored [2(|. In ad- 
dition, under usually benign approximations, the D^Z'^) 
that maximizes A.f(U ; z~ rn s, z^ 1 , z v 2 ) for all pairs z^ 1 , z v 2 
is Ef(G(z) I z~ri,s), where the expectation value is over 
z n . The associated difference utility is called the Aris- 
tocrat utility (AU). If each rj uses its AU as its private 
utility, then we have both good terms 2 and 3. 

Evaluating the expectation value in AU can be difficult 
in practice. This motivates the Wonderful Life Utility 
(WLU), which requires no such evaluation: 

WLUr, = G{z) -G(z~ v ,CL v ) , (4) 

where CL n is the clamping parameter. WLU is fac- 
tored, independent of the clamping parameter. Further- 
more, while not matching AU, WLU typically has far 
better learnability than does a team game, and therefore 
typically results in better values of G. It is also often 
easier to evaluate than is G itself 0, [2l| . 

One way to address term 1 as well as 2 and 3 is to 
incorporate exploration/exploitation techniques like SA. 



III. EXPERIMENTS 

In our version of SA, at the beginning of each time- 
step t a distribution /ir;(C?j) is formed for every rj by al- 
lotting probability 75% to the move rj had at the end 
of the preceding time-step, z^^-i, and uniformly divid- 
ing probability 25% across all of its other moves. The 
"exploration" joint-move z exp i is then formed by simul- 
taneously sampling all the h v . If G(z exp i) > G(z t -i), z n j 
is set to z exp i. Otherwise z% is set by sampling a Boltz- 
mann distribution having energies G(zt-i) and G(z exp i). 
Many different annealing schedules were investigated; all 
results below are for best schedules found. 
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IC is identical except that each is replaced by 
ft,„(z,,)c„ f (2„) _ w here the distribution c„ t is set by an 

2^ a l fir) \ a rJ ) c rj,t v a T? ) " 

RL algorithm trying to optimize payoffs g v . Here RL is 
done using a training set n^. t of all preceding move-payoff 
pairs, {(z v ,t' , gri(zt') '■ t' < t)}. For each possible move by 
X] one forms the weighted average of the payoffs recorded 
in t that occurred with that move, where the weights 
decay exponentially in t — t' . then is the Boltzmann 
distribution, parameterized by a "learning temperature" 
(that effectively rescales g v ) over those averages. 

In all our experiments the "AU" version of IC approx- 
imated / to to be uniform V77, and then used a mean- 
field approximation to pull the expectation inside G. Un- 
less otherwise specified, the clamping elements used in 
WLU's were set to 0. 

In bin-packing N items, all of size < c, must be as- 
signed into a minimal subset of N bins, without assigning 
a summed size > c to any one bin. G of an assignment 
pattern is the number of occupied bins [fj , and each agent 
controls the bin choice of one item. To improve perfor- 
mance all algorithms use a modified "G", G g0 ^, even 
though their performance is measured with G: 



G 



soft 



(§) 2 -(^-§r 



if x t 
if Xi 



< c 



> c 



(•5) 



where X{ is the summed size of all items in bin i. (Use of 
Ggoft encourages bins to be either full or empty.) 

In the IC runs learning temperature was .2, and all 
agents made the transition to RL-based moves after a 
period of 100 random z's used to generate the starting 
n v . Exploitation temperature started at .5 for all algo- 
rithms, and was multiplied by .8 every 100 exploitation 
time-steps In each SA run, the distribution h was slowly 
modified to generate solutions that differed in fewer items 
than the current solution as time progressed. 



Algorithm 


Ave. G 


Best 


Worst 


% Optimum 


IC WLU 


3.32 ± 0.22 


2 


8 


72 % 


IC TG 


7.84 ± 0.17 


6 


10 


% 


COIN WLU 


3.52 ± 0.20 


2 


7 


64 % 


COIN TG 


7.84 ± 0.15 


6 


9 


% 


SA 


6.00 ± 0.19 


4 


7 


% 



TABLE I: Bin-packing G at time 200 for N = 20, c = 12. 

In Table 1 "Best" refers to the best end-of-run G 
achieved by the associated algorithm, "worst" is the 
worst value, and "%Optimum" is the percentage of runs 
that were within one bin of the best value. Fig. ^ shows 
average performances (over 25 runs) as a function of time 
step. The algorithms that account for both terms 2 
and 3 — IC WLU and COIN WLU — far outperform 
the others, with the algorithm accounting for all three 
terms doing best. The worst algorithms were those that 
accounted for only a single term (SA and COIN TG). 



Linearly (i.e., optimistically) extrapolating SA's perfor- 
mance from time 15000 indicates it would take over 1000 
times as long as IC WLU to reach the G value IC WLU 
reaches at time 200. In addition the ratio of WLU's time 
1000 performance (relative to random search) to SA's 
grows linearly with the size of the problem. Finally, Fig. El 
illustrates that the benefit of addressing terms 2 and 3 
grows with the difficulty of the problem. In both figures 
SA outperforms IC - TG; this is due to there being more 
parameter-tuning with SA. 
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FIG. 1: Average bin-packing G for TV = 50, c = 10. All error 
bars < .31 except IC - AU and COIN - AU are < .57. 
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FIG. 2: G vs. c for N = 20 at t = 200. All error bars < .34. 

For the format choice problem G is the sum over all 
N a agents r\ of 77's "happiness" with its music formats: 

N a N f 

g = ^2 ^ ^'i* p re ^hi ( 6 ) 

rj— 1 i— 1 r)' Gneighn 

where Nf is the numbers of formats; neigh^ is the set 
of players lying < D hops away from player ?y; pref v ,i 
is 77's intrinsic preferenece for format i (set randomly at 
initialization G [0, 1]); is the total number of players 
that choose format i (i.e., the inverse price for format i); 
and w;, r ;,r;' = 1 if the choices of players rj and rj' both 
include the format i, and otherwise (each agent's move 
is a selection of three of four total formats, implemented 
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FIG. 3: G(t = 200) for 100 agents. In order from left to right, 
D = {1, 1, 3, 3}, and topologies are {L,W,L,W}. 



by choosing the one format not to be used). D values of 
both 1 and 3 were investigated. 

In Fig. "IC Econ" refers to WLU IC where clamping 
means the agent chooses no format whatsoever. It is es- 
sentially the game-theory Groves mechanism wherein one 



sets g v to r;'s marginal contribution to G, here rescaled 
and interleaved with a simulated annealing step to im- 
prove performance. "IC-WLU" instead clamps 77's move 
to zero (in accord with the theory of collectives), which 
means that 77 chooses all formats. Learning temperature 
was now .4, and exploitation temperature was .05 (an- 
nealing provided no advantage since runs were short). 
Two network topologies were investigated. Both were 
m-node rings with an extra .06m random links added, a 
new such set for each of the 50 runs giving a plotted av- 
erage value. "Short links" (L) means that all extra links 
connected players two hops apart, while "small- worlds" 
(W) means there was no such restriction. 

IC Econ's inferior performance illustrates the short- 
coming of economics-like algorithms. For D = 1 SA did 
not benefit from small worlds connections, and IC vari- 
ants barely benefited ( 3%), despite the associated drop 
in average inter-node hop distance. However if D also 
increased, so that G directly reflected the change in the 
topology, then the gain with a small worlds topologvgrew 
to 10%. (See the discussion on path lengths in 14].) 

The authors thank Michael New, Bill Macready, and 
Charlie Strauss for helpful comments. 
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